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Abstract. We propose a method to exponentially speed up computation of var- 
ious fingerprints, such as the ones used to compute similarity and rarity in mas- 
sive data sets. Rather then maintaining the full stream of b items of a universe 
[u], such methods only maintain a concise fingerprint of the stream, and perform 
computations using the fingerprints. The computations are done approximately, 
and the required fingerprint size k depends on the desired accuracy e and con- 
fidence 5. Our technique maintains a single bit per hash function, rather than a 
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single integer, thus requiring a fingerprint of length k = 0( —^£- ) bits, rather than 

0(log u ■ -^-f- ) bits required by previous approaches. The main advantage of the 
fingerprints we propose is that rather than computing the fingerprint of a stream 
of b items in time of 0(b ■ k), we can compute it in time 0(b\ogk). Thus this 
allows an exponential speedup for the fingerprint construction, or alternatively al- 
lows achieving a much higher accuracy while preserving computation time. Our 
methods rely on a specific family of pseudo-random hashes for which we can 
quickly locate hashes resulting in small values. 

1 Introduction 

Hashing is a key tool in processing massive data sets. Many uses of hashing in vari- 
ous applications require computing many hash functions in parallel. In this paper we 
present a technique that "ties together" many hashes in a novel way, which enables 
us to speed up such algorithms by an exponential factor. Our method also works for 
some complicated hash function such as min-wise independent families of hashes. In 
this paper we focus on producing an optimal similarity fingerprint using this method, 
but our technique is general, as it is easy to use our approach to speed up other hash in- 
tensive computations. One easy example where our technique applies is approximating 
the number of distinct elements from (T). A another example, which requires a slightly 
stronger analysis, is computing of L p sketches lfT3ll for < p < 2 . 

Min-wise independent families of hash functions, which we call MWIFs for short, 
were introduced in II 16161 . Computations using MWIFs have been used in many al- 
gorithms for processing massive data streams. The properties of MWIFs allow main- 
taining concise descriptions of massive streams. These descriptions, called "finger- 
prints" or "sketches", allow computing properties of these streams and relations be- 
tween them. Examples of such "fingerprint" computations include data summerization 
and subpopulation-size queries [9 8], greedy list intersection fi4*l . approximating rarity 
and similarity for data streams iflOl . collaborative filtering fingerprints [4 3 2 | and es- 
timating frequency moments [1 1. Another motivation for studying MWIFs is reducing 
the amount of randomness used by algorithms 1171 1 6161 . 



Recent research reduced the amount of information stored, while accurately com- 
puting properties data streams. Such techniques improve the space complexity, but 
much less attention has been given to computation complexity. For example, many 
streaming algorithms compute huge amounts of hashes, as they apply many hashes to 
each element in a very long stream of elements. This leads to a high computation time, 
not always tractable for many applications. 

Our main contribution is a method allowing an exponential speedup in computation 
time for constructing fingerprints of massive data streams. Our technique is general, and 
can speed up many processes that apply many random hashes. The heart of the method 
lies in using a specific family of pseudo-random hashes shown to be approximately- 
MWIF Il2l . and for which we can quickly locate the hashes resulting in a small value 
of an element under the hash. Similarly to fTTl we use the fact that members of the 
family are pairwise independent between themselves. We also extend the technique and 
show one can maintain just a single bit rather than the full element IDs, thus improving 
the fingerprint size. Independently of us lfl5l also considered storing few bits per hash 
function, but focused only on minimizing storage rather than computation time. 

1.1 Preliminaries 

Let H be a family of functions over the same source X and target Y , so each h £ H is 
a function h : X Y, where Y is a completely ordered set. We say that H is min-wise 
independent if, when randomly choosing a function h G H, for any subset CCI, any 
x € C has an equal probability of being the minimal after applying h. 

Definition 1. H is min-wise independent (MWIF), if for all C C X, for any x € C, 

Pr heH [h(x) = min a< z C h{a)} = 

Definition 2. H is a "/-approximately min-wise independent ( j-MWIF), if for all C C 



X, for any i£C, 



Pr heH [h(x) = min aeC h(a)} - ^ 



< -3- 
- |C| 



Definition 3. H is k-wise independent, if for all Xi, X2, • ■ ■ , x k ,yi, 2/2 > • • • j Vk Q X, 

Pr heH [{h{ Xl ) = yi) A ... A (h(x k ) = y k )} = jypr 



2 Pseudo-Random Family of Hashes 

We describe the hashes we use. Given the universe of item IDs [u], consider a big prime 
p, such that p > u. Consider taking random coefficients for a <i-degree polynomial in 
Z p . Let ao, ai , . . . , a,* £ [p] be chosen uniformly at random from [p], and the following 
polynomial in Z p : f(x) = ao + a.\x + a^x 2 + . . . + aax d . We denote by Fd the family 
of all d-degree polynomials in Z p with coefficients in Z p , and later choose members of 
this family uniformly at random. Indyk [12 | shows that choosing a function / from Fd 
uniformly at random results in Fd being a 7-MWIF for d = 0(log -), 

Randomly choosing ao, . . . , a<j is equivalent to choosing a member of Fd uniformly 
at random, so f(x) = ao + a\x + a^x 2 + . . . + adX d is a hash chosen at random from 
the 7-MWIF Fd. Similarly, consider bo,bi, . . . ,bd € [p] be chosen uniformly at ran- 
dom from [p], and g(x) = b + bix + b 2 x 2 + . . . + bdx d , which is also a hash chosen 



at random from the 7-MWIF Fd. Now consider the hashes ho(x) = f(x), h\(x) = 
f(x) + g(x),h 2 (x) = f(x) + 2g(x),...,hi(x) = f(x)+ig(x),...,h k -i{x) = 
f(x) + (k — l)g(x). We call this random construction procedure for f(x),g(x) the 
base random construction, and the construction of hi the composition construction. We 
prove properties of such hashes. We denote the probability of an event E when the hash 
h is constructed by choosing /, g using the base random construction and composing 
h{x) = f(x) + i ■ g(x) (for some i e [p]) as Prh{E). 

Lemma 1 (Uniform Minimal Values). Let /, g be constructed using the base random 
construction, using d — 0(log - ). For any z G [u], any X C [u] and any value i used 

to compose h(x) = f(x) + i ■ g(x): Prh[h(z) < min ye x(h(y)} = (1 ± 7)]X|- 

Proof. Fix i, z G [u] and X C [u], construct /, g using the base random construction, 
and compose h(x) = f(x) + i ■ g(x). Note in i ■ g(x) = i ■ (bo + b\x + . . . bdx d ), the 
coefficient of x^ is q = (i ■ bj) mod p. Given a value s e [p] There is exactly one value 
in r G [p] such that (q + r) mod p = s. Thus, for any s £ [p], the probability that the 
coefficient of x J in h(x) is s is ^. Therefor Prh[h(x) = p(x)] = ^3^- = jr. We have: 
Pr h [h(z) < min yex h(y)} = £ p(x)eF<J Pr h[h{z) < min yeX h(y)\h(x) = f(x) + 
i ■ g(x) ee p(x)} ■ Pr h [h{x) ee p(x)} = J2 p{x) eF d p 'M*)<™yxWM*^rW = 
E pixKFd Pr ^<™ = Pr p(x)£F >(,) < rmn yex (p(y))} = (1 ± 7 )^. 
If is a polynom such that for any z e Z p we have < min ye x (p(y)), then we 
have Pr[p(z) < min ye x(p(y))] = 1. and otherwise Pr[p(z) < min y6 x(p(2/))] = 0. 
Thus we get £ p(x)£F<j ^W<^.«W)1 = Pr p(x)eF >(*) < rm %e *(p(y))]. 
The last transition uses the fact that Fd is an 7-MWIF, which requires d = 0(log - ). 

Lemma 2 (Pairwise Interaction). Let /, g be constructed using the base random con- 
struction, using d — <3(log ^ ). For all x\,x 2 € [u] and all Xi,X 2 C [u], ami i ^ j 
used to compose hi(x) = f(x) + i ■ g(x) and hj(x) = f(x) + j ■ g(x): 

1 

Pr f,geF d [( h i( x i) < min ve x 1 hi(y))A(h j (x 2 ) < min ye x 2 hi(y))} = {1±jY 



Proof. Given pi(x) e Fd = u + u\x + . . . + UdX d and p 2 (x) £ Fd = v + v\x + 
. . . + VdX d , there is exactly one pair of polynoms f(x),g(x) E Fd such that both 
f(x) + i ■ g(x) = pi(x) and f(x) + j ■ g(x) — p 2 (x). Each coefficient location I e [d] 
results in two equations with two unknowns in Z p , with a single solution (a;, b{) (where 
ai is the coefficient of x l in f(x), and bi is the coefficient of x l in g(x). 

Fix i 7^ j, x\,x 2 G [u] and Xi,X 2 C [u], construct f,g using the base ran- 
dom construction, and compose hi(x) = f(x) + i ■ g(x), hj(x) = f(x) + j ■ g(x). 
For brevity, denote m\ = minj, e xi hi(y). Similarly, denote m 2 = minj, e x 2 hj(y). 
We have: Pr ftgeFd [(fcifci) < m \) A (^(^2) < mi)} = E Pl , P2 eF d Pr i(^( x i) < 
mi) A (7ij(ar 2 ) < m|)|(/ij(a;) ee p x (a;) A ^-(a;) ee p 2 {x))} ■ Pr[(hi(x) = pi(x) A 

h (rr\ — n (rr\W — Pr ( x l ) <m 'l ) A ( a 2 ) <m 2 ) I ( h i ( x )=Pl ( x ) (^)=P2 (x) )] 

"•jW - P2{X))i - 2^ Pl , P2 e Fd \ Fd \ 2 ' 

Thus, Pr^K/^) < mijA^^) < m^)] = E Pl , P2 eF d 



E 



\2 1 



Pr[p 1 (x 1 )<m\]-Pr[p2(x2)<m 3 2 \ 



\F d \ 2 



J2 Pl £F d J2p 2 £F d 



Pr[p 1 (x 1 )<m\] Pr[p2(x 2 )<rnl] 
\F d \ ' \F d \ 



3 Fingerprinting Using Pseudo-Random Hashes 

Several methods were suggested for building fingerprints for approximating relations 
between massive datasets, such as the Jackard similarity (see [6] for example). Given 
a universe U, where \U\ — u, consider C\,C2, where each Cj C U is described as 
a set \Ci\ integers in [u] (we use [u] to denote {1,2, ... ,u}). The Jackard similarity 

I C f~\C I 

is J12 = L-/ Ug ^ , Many fingerprints rely on applying many hashes to each elements 
in the long streams. We use a the hashes of Section [2] to exponentially speed up such 
computations. We use pseudo-random effects in this hash, so we must relax the MWIF 
requirement to a pairwise independence requirement (2-wise independence). 

For completeness, we briefly consider previously suggested approaches for approx- 
imating Jackard similarity |6 1. Let h € H be a randomly chosen function from a MWIF 
H. We can apply h on all elements C\ and examine the minimal integer we get, m\ = 
argmin-rgc'i h(x). We can do the same to C2 and examine m 2 l = argmin xS c 2 h(x). 
Fingerprints for estimating the Jackard similarity are based on computing the probabil- 
ity that mi = 777-2- Pr heH [m'l = m%] = Pr heH [arg min^Ci h(x) = argmin^gCa h(x)]. 

Theorem 1 (Jackard and MWIF Collision Probability). Pr heH [m'i = m 1 -] = J t .j. 
The proof is given in [6], and in the appendix for completeness. 

Similarly, regarding a hash h from a 7-MWJF, 115 161 shows that: 

Theorem 2. \Pr heH [m* = m)\ - J id \ < 7. 

Rather than maintaining the full Cj's, previous approaches [5 6| suggest maintain- 
ing their fingerprints. Given k hashes hi , . . . , hk randomly chosen from an 7-MWIF, 
we can maintain ttt,^ 1 , . . . , m^ h . Given Ci, Cj, for any x € [k], the probability that 
ral" — 777^ x is Ji j ± 7. A hash h x where we have m^* = rrij is called a hash col- 
lision. We can thus estimate J by counting the proportion of collision hashes out of 
all the chosen hashes. In this approach, the fingerprint contains fc item identities in [u], 
since for any x, m' ix is in [u\. Thus, such a fingerprint requires k log u bits. To achieve 

111 i 

an accuracy e and confidence 5, such approaches require k = 0{-^-). Our basis for 
the fingerprint is a "block fingerprint" which allows approximating Jjj with a given 
accuracy e and a confidence of |. This block fingerprint maintains only a single bit 
per hash, as opposed to previous approaches which maintain log u bits per hash. Later 
we show how to achieve a given accuracy e with a given confidence S, by combining 
several block fingerprints, and creating a full fingerprint. 

To shorten the fingerprints using a single bit per hash, we use a hash mapping el- 
ements in [u] to a single bit — <fi : [u] — > {0, 1}, taken from a pairwise independent 
family (PWIF for short) of such hashes. Rather than defining = arg miring c"i h(x) 
we define mf' h = </>(arg min^gCi h(x)). Maintaining mf' h rather than mf shortens 
the fingerprint by a factor of log u. We examine the resulting accuracy and confidence. 



Theorem 3. Pr heH [mf h = mf h ] = ^ + \ ± £ 

Proof. Pr h eH,<f>£H>[mf Jl = m^ ,h ] = Pr[mf' = mf h \m^ = m!-] ■ Pr heH [m^ = 
nij] + Pr[mf' h — mj' ^ m']] ■ Pr heH [mf ^ mj} = 1 • Pr /ie /f-[?7i l ' 1 = m^] + 
i • (1 - Pr heH [m>> = m* ]) = 

The purpose of the fingerprint block is to provide an approximation of J with 
accuracy e. We use k hashes, and choose k = ^02 _ rj eno t e a _ 2 -1 ^ an( j j et 
7 = (1 — a) • e = 2To£. We construct a 7-MWIF|^To construct the family, consider 
choosing a,Q, . . . , a& and bo, b\, . . . , bd uniformly at random from [p], constructing the 
polynomials f{x) = a + aix + a 2 x 2 + . . . + a,dX d , g(x) = b + bix + b 2 x 2 + . . . + bdX d , 
and using the k hashes hi(x) = f(x) + ig(x), where i € {0, 1, . . . , k — 1}. We also use 
a hash : [u] — » {0, 1} chosen from the PWIF of such hashes. We say there is a colli- 
sion on hi if mf' 1 = mt' hl , and denote the random variable Zi where Zi — 1 if there 
is a collision on hi for users i, j and Zi = if there is no such collision. Z\ = 1 with 
probability | + f ± | and = with probability § - 1 ± 2 Thus P(Z i ) = i + |±|. 
Denote X; = 2Z ; - 1. J5(Xi) = 2.E(Zj) - 1 = J ± 7. X; can take two values, -1 
when Z; = 0, and 1 when Zi = 1. Thus X ; 2 always takes the value of 1, so E(Xf) = 1. 
Consider X = J^=i X;, and take Y = J = $ as an estimator for J. We show that for 
the above choice of k, Y is accurate up to e with probability of at least | . 

Theorem 4 (Simple Estimator). Pr(\Y - J\ < e) > |. Proof gi ven in appendix. 

Due to Theorem |4j we can approximate J with accuracy e and confidence | using 

a "block fingerprint" for Cj, composed of m^ 1 '^ 1 , . . . , m^ fc '^ fc , where hi, . .. ,hk are 
randomly constructed members of a 7-MWIF and </>i, . . . , 0^ are chosen from the PWIF 
of hashes : [u] — > {0, 1}. We shows that it suffices to take k = 0{\) to achieve this. 
Constructing each hi can be done by choosing /, g using the base random construction 
and composing hi(x) = f(x) + i ■ g(x). The base random construction chooses f,g 
uniformly at random from Fd, the family of <i-degree polynoms in Z p , where d = 
0(log -). This achieves a 7-MWIF where 7 = (1 — a) ■ e = gio £• 

Achieving a Desired Confidence We combine several independent fingerprints to in- 
crease the confidence to a desired level 8. Section [3] used a fingerprint of length k to 
achieve a confidence of |. Consider taking m fingerprints for each stream, each of 
length k. Given two streams, i, j, we have m pairs of fingerprints, each approximating J 
with accuracy e, and confidence |. Denote the estimators we obtain as Ji, J2, ■ ■ ■ , J m , 
and denote the median of these values as J. Consider using m > ^ In | "blocks". 

Theorem 5 (Median Estimator). Pr(\ J — J\ < e) > 1 — 8. Proof given in appendix. 

Due to Theorem 5 to make sure that | J — J\ < e it suffices to take m > ^ In i 
fingerprints, each with k = ^f- hashes. In total, it is enough to take ^ In j ■ ^f- < 
28.45in s hashes. Thus, we use 0( 1 ^2 i ) hashes, storing a single bit per hash. 



3 The accuracy 7 is much stronger than the overall accuracy e required of the full fingerprint, 
for reasons to be later examined 



4 Fast Method for Computing the Fingerprint 



We discuss speeding up the fingerprint computation. Consider computing the fingerprint 
for a set of b items X = {xi, . . . , Xb] where Xi G [u]. The fingerprint is composed of 
m "block fingerprints", where block r is constructed using k hashes h\, . . . ,h r k , built 
using 2 • d random coefficients in Z p , The i'th location in the block is the minimal 
item in X under hf. mi = argmin^gx hi(x), which is then hashed through a hash <f> 
mapping elements in [u] to a single bit. We show how to quickly compute the block 
fingerprint (mi, . . . , m^). A naive way to do this is applying k ■ b hashes to compute 
hi(xj) for i G [k],j G [b]. The values hi(xi) where i 6 [k],j £ [b] form a matrix, 
where row i has the values (hi(xi), . . . , hi(xb)), illustrated in Figure[T| 



hO(xl) ■■ h0(x2) ■■ h0(x3) ■■ h0(x4) 



hl(X)=f(X)+g(X} ■ hl(xl) hl(x2) H hl[x3) H hl(x4) 



h2(X)=f(X)+2g(X) ■ h2(xl) ■■ h2(x2) H h2[x3) H h2(x4) 



h3(X)=f(X)+3g(X) ■ h3(xl) H h3(x2) H h3(x3) H h3(x4) 



hk(X)=f(X)+kg(X) ■ hk(xl) hk(x2) ■■ hk(x3) ■■ hk(x4) 



Fig. 1. A fingerprint "chunk" for a stream. 



Once all hi(xj) values are computed for i E [k],j G [b] , for each row % we 
check for which column j the row's minimal value occurs, and store 7ji{ — Xj, as 
illustrated in the left of Figure [2] Thus, computing the fingerprint requires finding 
the minimal value across the rows (or more precisely, the value Xj for the column j 
where this minimal value occurs). To speed up the process, we use a method similar 
to the one discussed in [18| as a building block. Recall the hashes hi were defined 
as hi(x) — f(x) + ig(x) where f(x),g(x) are d-degree polynomials with random 
coefficients in Z p . Our algorithm is based on a procedure that gets a value x G [u] 
and a threshold t, and returns all elements in (ho(x), hi(x), . . . , hk-i(x)) which are 
smaller than t, as well as their locations. Formally, the method returns the index list 
It = {i\hi(x) < t} and the value list V t = {hi(x)\i G It} (note these are lists, so 
the j'th location in V t , V t [j], contains h It ^(x)). We call this the column procedure, 
and denote by pr — small — loc(f(x),g(x), k, x, t) the function that returns I t , and by 
pr — small — val(f(x),g(x), k,x,t) the function that returns Vt ■ We describe a certain 



implementation of these operations in Section 4. 1 The running time of this implemen- 
tation is 0(\ogk + \It\), rather than the naive algorithm which evaluates 0(k) hashes. 



Thus, this procedure quickly finds small elements across columns (where by "small" 
we mean smaller than t). This is illustrated on the right of Figure [2] 




Fig. 2. Finding small elements across columns rather than minimal elements across 
rows 

Roughly speaking, our algorithm maintains a bound for the minimal value for each 
row, and operates by going through the columns, finding the small values in each of 
them, and updating the bounds for the rows where these occur. 

block ~ update{{x\, . . . , Xf,), f(x), g(x), k, t) : 

1 . Let rrii = oo for i e [k] 

2. Let pi = for i e [k] 

3. For j — 1 to b: 

(a) Let I t = pr — small — val(f(x), g(x), k, Xj, t) 

(b) Let Vt — pr — small — loc(f(x),g(x),k, Xj, t) 

(c) For y E It- II Indices of the small elements 

i. If mi f [y] > Vt[y] II Update to row x required 
A. m It[y] = V t [y] 

B - Ph[y\ = x 3 

If our method updates mi,Pi for row i, once the procedure is done, m% indeed 
contains the minimal value in that row, and pi the column where this minimal value 
occurs, since if even a single update occurred then the row indeed contains an item 
that is smaller than t, so the minimal item in that row is smaller than t and an update 
would occur for that item. On the other hand, if all the items in a row are bigger than 
t, an update would not occur for that row. The running time of the column procedure is 
0(log k + \It\), which is a random variable, that depends on the number of elements 
returned for that column, \I t \. Denote by Lj the number of elements returned for column 
j (i.e. | i" t | for column j). Since we have b columns, the running time of the block update 
is 0(&log k) +0(Y^ b j = i Lj) - The total number of returned elements is Ylj=i Lj, which 
is the total number of elements that are smaller than t. We denote by Y t = Y^j=i Lj the 
random variable which is the number of all elements in the block that are smaller than 
t. The running time of our block update is thus 0(b log k + Yt), 



The random variable Y t depends on t, since the smaller t is the less elements are 
returned and the faster the column procedure runs. On the other hand, we only update 
rows whose minimal value is below t, so if t is too low we have a high probability 
of having rows which are not updated correctly. We show that a certain compromise 
t value allows achieving both a good running time of the block update, with a good 
probability of correctly computing the values for all the rows. 

Theorem 6. Given the threshold t = where V = 80+2 log \ (so V = 0(log \)), 

the runtime of the block — update procedure is 0(b\og - + \ log -). 

Proof. Recall that to get a 7-MWIF (for 7 = ^rn-e) we used d — 0(log -) as the 
degree of the random polynoms /, g in the base random construction, used to compose 
the hi, . . . , hk hashes. Examining the constant in the work of Indyk [ 12] shows that the 
requirement is d > 80 + 2 log -. Denote /' = 80 + 2 log =, Due to our choice of d we 
have d > I', so the hashes h\,...,hf, were effectively chosen at random from an T-wise 
independent family. Let H be an V — wise independent family of hashes. Consider the 
following equation from [12|, regarding E t , the expected number of elements x G X 
such that h(x) < t (i.e. elements that are smaller than t under h chosen at random from 

H): Pr[min xex h{x) > t] < 48 (^) ( * ~ 1)/2 . 

When computing the fingerprint for the elements in X, we know |X^]and denoted 
\X\ = b. Each hi is 7-MWIF, so E t = f. Now consider choosing t = ^j^- 
Under this choiceP of t — we have E t = ^ = 121' and using the fact that 

I' = 80 + 2 log |the above lemma can be rewritten as: Pr[min xe xh(x) > t] < 

(|Q {l 1)/2 = 48 • (|) * • {^) 21 ° S ' < • e 2 - There are k rows, and by applying 

the union bound we obtain: Pr[3i £ [k](min xe xhi(x) > t] < |g§ = ^s 02 .'^ < ■ 

We prove our algorithm runs in time 0(b\og - + \ log -) with high probability. 
We have kb random values, hi(x\), . . . , hk-i(xb), which are (at least) pairwise inde- 
pendent. Denote Y^j the indicator variable of the event that hj(xi) < t = ^r— , and so 
Pr[Y hl = 1] = if and E[Y ia = 1] = if. Then Y = £-=o Y^Zl Yi,r The running 
time of the algorithm is 0(&log -+Y). We show that Y = 0( \ log - ) with highprob- 

ability We obtain: E[Y] = S[E- = o E^o = E to E^o E ^ = 12 " V ' k - 
We use the following lemma, proven in the appendix: Var{Y) < E(Y), and using 
Chebychev's inequality obtain: Pr[Y > UE(Y)] < Pr[\Y-E(Y)\ > 10Var(y)] < 
jgg. To guarantee the required run time in a worst case analysis, we can drop all the 



48 



4 We use this assumption for simplicity. If we don't know \X\, we can update the threshold 

log - 

t online. We store all elements until we have e2 s elements. Then we set t according to 

b = 2 °% 7 ■ We double 6 by 2 each time |Jf| > 6 and update t according to the new b. 
Notice that this constant is only to bound the worst case usually in a block the maximum 
between the minimal values is about I' moreover we can improve the running time if we drop 
from the sketch all the hash functions which there minimal value is to big. 
6 We base our calculation on the pairwise independence of Yij . Notice that Yij is more inde- 
pendent when running over i. Therefor in practice the constants are smaller. 



blocks which require too long to compute. This reduces our probability of success in 
each block from | to at least | — 2~ 29 — (The 2~ 29 factor is due to the probability 
that there exists a hash that gets a minimum value higher than t). Taking 4 log I blocks 
still obtains this probability. Overall the algorithm runs in time 0(b\og - + \ log \ ) 
per block, or 0(log | (6 log \ + ^ log \ )) for all blocks. 

4.1 Computing The Minimal Elements of the Pseudo-Random Series 

We give a recursive implementation of pr — small — loc(f(x), g(x), k, x, t) and pr — 
small — val(f(x),g(x), k,x,t), the procedures for computing Vt and I t . Recall the 
hashes h t were defined as hi(x) — f(x) + ig(x) where f(x),g(x) are d-degree poly- 
nomials with random coefficients in 7L V . Consider a given element x € Z p for which 
we attempt to find all the values (and indices) in (ho(x), ti2(x), . . . , hk-i(x)) smaller 
than t. Given x, we can evaluate f(x),g(x) in time 0{d) = 0(log and denote 
a = f(x) G Z p and b = g(x) E Z p . Thus, we are seek all values in {a mod p, (a + b) 
mod p, (a + 26) mod p, . . . , (a + (k — 1)6) mod p] smaller than t, and the in- 
dices i where they occur. Consider the series S = (si, . . . , Sfc) where s; = (a + ib) 
mod p and i = {0,1, .... k — 1}. We denote the arithmetic series a + bi mod p for 
i € {0, 1, . . . , k — 1} as S(a, b, k,p), so under this notation S = S(a, b, k,p). 

Given a value we can find the index where it occurs, and vice versa. To compute the 
value for index i, we compute (a + ib) mod p. To compute the index i where a value v 
occurs, we solve v = a + ib in Z p (i.e. i = mod p). This can be done in O(logp) 
time using Euclid's algorithm. Note we compute 6 _1 in Z p only once to transform all 
values to generating indice^] We call a location i where < Sj_! a flip location. The 
first index is a flip location if a — b mod p > a. First, consider the case 6 < | . If s; is 
a flip location, we have Si-i < p but s.;_i + 6 > p, so Si < b. Also, since 6 < | there is 
at least one location which is not a flip location between any two flip locations. Given 
£ = S(a, 6, k,p), denote by f(S) the flip locations in S. 

Lemma 3 (Flip Locations Are Small). When b < |, at most | elements are flip loca- 
tions, and all elements that are smaller than 6 are flip locations. 

Proof. Note that the non-flip locations between any two flip locations are mono tonic ally 
increasing. Any flip location has a value of at most 6, since the element before a flip 
location is smaller than p (modulo p), and adding 6 to it exceeds p, but through this 
addition it is impossible to exceed p by more than 6. 

We denoted by f(S) the flip locations of S. Denote f Q (S) = f(S). Denote by 
fi(S) all elements that occur directly after a flip location, ^(S*) all elements that occur 

7 Using multipoint evaluation we can calculate it in amortized time 0(log 2 log -). Moreover 
we can use other constructions for d-wise independent which can be evaluate in O(l) time in 
the cost of using more space. 

8 We can store a table of inverse to further reduce processing time. If the required memory for 
the table is unavailable, we can do the computation in F p c for smaller p and store table of 
size p and then calculating the inverse requires Oic log c) time. Notice that we can easily take 
c < log , og i u which will probably be less then log - 



exactly two places after the closest flip locations (i.e they cannot be flip locations) and 
by fi(S) all elements that occur i places after the closest flip location. 

Lemma 4 (Element Comparison). When b < |, ifx G fi{S) and y e f 3 (S) where 
i > j, then x > y. 

Proof. All flip locations have a value of at most b. Due to Lemma|3] a location directly 
after a flip location is not a flip location, and is thus bigger than the flip location before 
it by exactly b, and is thus greater than b. Thus any element in f±(S) must be greater 
than any element in fo(S). Using the same argument, we see that any element in f2{S) 
is greater than any element in fx(S) and so on. A simple induction completes the proof. 

The first flip location is T^y^l , as to exceed p we add b T^p] times. Also, the num- 
ber of flip locations is [ a+ ^ k j . Denote the first flip location as j — f^p], with value 
a' = (a + jb) mod p. Denote V = (b — p) mod b and the number of flip locations as 
k' = |^ ( a + bfc ) j xhe flip locations are known to also be an arithmetic progression |[T8ll |^j 

Lemma 5 (Flip Locations Arithmetic Progression). The flip locations ofS = S(a, b, fc, p) 
are also an arithmetic progression S' = (a', k' , b). 

Given the above lemmas, we can search for the elements smaller than t, by examin- 
ing the flip locations series in recursion. If case b < t, given q = \t\ b, due to Lemma|4] 
f(S), fi(S), . . . f q -i(S) are smaller then i, and all of their elements must be returned. 
We must also scan f q (S) and also return all the elements of f q (S) which are smaller 
then t. This additional scan requires 0(\f q (S)\) time I/^jS 1 )! < Thus this case 

of b < t examines O (| It | ) elements . Due to Lemma|3] if b > t, all non-flip locations are 
bigger than b and thus bigger than t, and thus we must only consider the flip-locations as 
candidates. Using Lemma[5]we can scan the flip locations recursively by examining the 
arithmetic series of the flip locations. If at most half of the elements in each recursion 
are flip locations, this results in a logarithmic running time. However, if b is high more 
than half the elements are flip locations. For the case where b > | we can examine the 
same flip-location series S', in reverse order. The first element in the reversed series 
would be the last element of the current series, and rather than progressing in steps of b, 
we progress in steps of p — b. This way we obtain exactly the same elements, but in re- 
verse order. However, in this reversed series, at most half the elements are flip locations. 
The following procedure implements the above method. It finds elements smaller then 
t in time 0(log k) = 0(log - + \ I t |) where \I t \ is the number of such values. Given the 
returned indices, we get the values in them. We use the same b for all |7 ( |, so this can 
be done in time 0(c log c + |ij|) (Usually c is a constant). 
ps — min(a, b,p,k,t) : 

1. if b < t: 

(a) V t = 

(b) if a < t then V t = V t + [a + ib for i in range ( f *=5] )] 

(c) j = f 2 ^] // First flip (excluding first location) 



9 See Lemma 2 page 11. 



(d) while j < k: 

i. v = (a + jb) mod p 

ii. while j < k and v < t: 

A. V t .append(v) 

B. j =3 + 1 

C. v = v + b 

iii. j = j + \^-~\ //next flip location 

iv. return listl 

(e) if b > | then return /((a + (k — 1) • b) mod p,p — b,p,k,t) 

(f) 3 = IV1 

(g) ne Wfe = 

(h) if a < 6 then j = and neuife = newk + 1// calculate the first flip location and 
the number of flip locations 

(i) return /((a + jb) mod p, —p mod b, b, newt, t) 

5 Conclusions 

We have presented a fast method for computing fingerprints of massive datasets, based 
on pseudo-random hashes. We note that although we have examined the Jackard simi- 
larity in detail, the exact same technique can be used for any fingerprint which is based 
on minimal elements under several hashes. Thus we have described a general technique 
for exponentially speeding up computation of such fingerprints. Our analysis has used 
fingerprints using a single bit per hash. We have shown that even for these small finger- 
prints which can be quickly computed, the required number of hashes is asymptotically 
similar to previously known methods, and is logarithmic in the required confidence and 
polynomial in the required accuracy. Several directions remain open for future research. 
Can we speed up the fingerprint computation even further? Can similar techniques be 
used for computing fingerprints that are not based on minimal elements under hashes? 
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6 Appendix: Proofs 

The proof of Theorem [TJ Pr/, e #[mf = rrij] = J, 



Proof. Denote x = J\ : 2- The set C% U Cj contains three types of items: items that 

appear only in C,-, items that appear only in Cj, and items that appear in Cj n Cj. 

When an item in Ci H Cj is minimal under h, i.e., for some a 6 Cj n Cj we have 

h{a) = mm a;e (7 lU( 7 2 ft,(x), we get that min x ^Cih{x) = min xe c h(x). On the other 

hand, if for some a E CjUCj such that a ^ CjflCj we have h(a) — min x€ c 1 uc 2 h'( x )> 

the probability that min x<£ Cih(x) — min x< zCjh{x) is negligible!^ Since H is MWIF, 

any element in C — Ci U Cj is equally likely to be minimal under h. However, only 

elements in I = Ci n Cj would result in = rrij. Thus Pr} l ^n\m\ = rrij] = 

1 \c. n c -\ — \ CinC i\ — j- . 
\CiUCj\ l° l 11 °Jl — |C;UC,| — J hl' 

The proof of Theorem |4] (Simple Estimator for Jackard With Single Bit Per Hash): 

Pr{\Y-J\<e)>l 



10 Such an event requires that two different items, Xi G Ci and Xj € Cj would be mapped to the 
same value h* = h(xi) = h(xj), and that this value would also be the minimal value obtained 
when applying h to both all the items in d and in Cj. As discussed in 1 12], the probability for 
this is negligible when the range of h is large enough. 



Proof. Our proof uses Chebychev's inequality: 

Pr { \X-E(X)\>e)<^P 

We have: 

k k 

E(X) = E(J2 Xi) = E E ^ = k-(J±l) 
1=1 1=1 

{J-l)<E(Y) < (J + 7) 

We now bound Var(X): 
Var(X) = E(X 2 ) - E 2 (X) 

= E(C£X 1 ) 2 )-E 2 (J2X 1 ) 
1=1 i=i 

k k 

= Ei^Xf +2^1^) - (££>0) 2 
i=i i^j i=i 

= ]T>(X ; 2 ) +2EWi) - CtE{ Xl )f 

1=1 i^j 1=1 

= ]T>(^ 2 ) +2^E(X i X j ) - ctE{Xi? +2^E(X i )E(X j )) 

i=i i^j i=i i& 

= j^E{X 2 ) +2j2E(Xi)E(Xj) CE E ( X <) 2 + 2 E^)ft)) 

i=i ijtj i=i ijtj 

= j2E(X?)-J2E(X l ) 2 <k 



i=i 



(1) 



We use this to bound VariY): 



Var(Y) = Var{\ ■ X) = y 2 Var(X) <^. k <=\ 



Using Chebychev's inequality we get that: 



Pr(\Y-E(Y)\>/3)<^p-< 



p 2 ~ k-p 2 

Denote a = 2l0 . Let p = a ■ e, so we obtain: 

Thus using our choice of k = ^4L and P = a ■ e (and noting that J < 1, e < 1) we 
have: 

Pr{\Y-E { Y ) \>P)<^ = 1 -^- 2 =^<\ 



Proof of Theorem[^(Median Estimator for Jackard): Pr(\J — J\ < e) > 1 — S. 



Proof. We use Hoeffding's inequality ifTTl , Let Xi, . . . ,X n be independent random 
variables, where all are bounded so that Xi £ [a,i,bi], and let X = X)"=i^- 
Hoeffding's inequality states that: 



2n 2 e 2 

Pi(X -E[X}> ne) < exp ' 



ET=i( & i - a if 



We say that the estimator J; is good if | jj — J| < e and that J; is bat/ if | j; — J| > e. 
Each estimator J; is bad with probability of p < |. Consider the random variable X[ 
where Xi = 1 if J; is bad, and X; = if J; is good. We have Pr(Xi = 1) = p < |, 
so E{X{) = p < |. Denote X = ^> so ^PO — m ' P — ' m ' §■ We now note 

that the J can be bad only if at least half the estimators J\ , . . . , J m are bad, or in other 
words, when X > ^. 

The Xi's are independent, since for any x, y the hashes used to obtain the J x are 
independent of the hashes used to obtain the J y . Since p < | we have: 



Tfi 3 3 

Pr(X > — ) < Pr{X > (- +p) ■ m) = Pi(X - rap > -to) 
2 8 8 



However, E(X) — mp, so using Hoeffding's inequality, we require that Pr(X > -£) < 
5: 

m 3 9 

Pr(X > — ) < Pt(X - nip > -to) < cxp(-2m ■ — ) < 6 



Extracting to we obtain that we require: 



32 , 1 

to > — In — 
9 S 



Proof of the lemma in Theorem [6] 



Lemma 6. Let Y = Y^,i=o ^2j=o Theorem^ Then Var\Y] < E[Y]. 



Proof. 

b k—1 b k—1 

Var[Y] = E[Y 2 ] E*[Y] = E[(£ ]T Y^f] - E*[f^ E Y ^ 

i=0 j=0 i=0 j=0 

- mi: E ^ + 2 E E w jo - (E E M 

= EE^i + 2 E E - (EX>[ y ^ 2 + 2 E E ^h^d 

*=0 j=0 i'^ij'^j i=0 j=0 i'^ij'^j 

= EE^i + 2 E E ^n^] - (EE^k/ + 2E E m.w^.A) 

i=0 3=0 i'^ij'^Lj i=0 j=0 i'^ij'^j 

b k-1 b fe-1 6 fe-1 b fe-1 

= E E ^] - E E W = E E w - E E s ^] 2 

i=0 3=0 i=0 j=0 i=0 j=0 i=0 j=0 

= E[Y\-jy£m,\ 2 <E[Y\ 

i=0 j=0 

(2) 



